This lab will be submitted in pairs using GitHub (if you don’t have a pair, please contact us).
Please follow the steps in the GitHub-Classroom Lab 1 to create your group’s Lab 1 repository.
Important: your team’s name must be FamilyName1_Name1_and_FamilyName2_Name2.
You can collaborate with your partner using the git environment; You can either make commits straight to the main repository, or create individual branches (recommended). However, once done, be sure to merge your branches to master - you will be graded using the most recent master version - your last push and merge before the deadline of May 15th.
Please do not open/review other peoples’ repositories - we will be notified by GitHub if you do.
Your final push should include this Rmd file (with your answers filled-in), together with the html file that is outputted automatically by knitr when you knit the Rmd. Anything else will be disregarded. In addition, please adhere to the following file format:
Lab_1_FamilyName1_Name1_and_FamilyName2_Name2.Rmd/html
The only allowed libraries are the following (please do not add additional libraries without permission from the course staff):
The world-of-data website hosts world-wide epidemiological data on the Corona Virus (COVID-19). The dataset is compiled by the Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) from various sources, and follows The dataset contains data since January 2020. For the data and more information about it, please visit here.
You can see several nice visualizations of the data here
we will focus on analyzing the Covid-19 cases, deaths and vaccinations data over time for different countries and continents.
Your solution should be submitted as a full Rmd report integrating text, code, figures and tables. For each question, describe first in the text of your solution what you’re trying to do, then include the relevant code, then the results (e.g. figures/tables) and then a textual description of them.
In most questions the extraction/manipulation of relevant parts of the data-frame can be performed using commands from the tidyverse and dplyr R packages, such as head, arrange, aggregate, group-by, filter, select, summaries, mutate, order etc.
When displaying tables, show the relevant columns and rows with meaningful names, and describe the results.
When displaying figures, make sure that the figure is clear to the reader, axis ranges are appropriate, labels for the axis , title and different curves/bars are displayed clearly (font sizes are large enough), a legend is shown when needed etc. Explain and describe in text what is shown in the figure.
In many cases, data are missing (e.g. NA). Make sure that all your calculations (e.g. taking the maximum, average, correlation etc.) take this into account. Specifically, the calculations should ignore the missing values to allow us to compute the desired results for the rest of the values (for example, using the option na.rm = TRUE).
Grading: There are \(10\) questions overall. Each question is worth \(10\) points for the lab grade. The questions vary in length and difficulty level. It is recommended to start with the simpler and shorter questions.
csv format from world-of-data into a data-frame in R.date variable to Date and check that the class is correct.explantion: we save the data set by “covid-19 name” and than changed class of date to “date”
covid_19 = read.csv("/Users/shaharzismanovich/Downloads/r52414_2021_22_lab1-zismanovich_shahar_and_honen_shahaf-main/owid-covid-data.csv")
covid_19$date = as.Date(covid_19$date)
print(class(covid_19$date))
## [1] "Date"
b. List in a table the top five *dates* in terms of number of `new_cases` for `High income` countries. Show only the date and the number of new cases at this date. <br>
Repeat the same with two additional separate tables for top five dates for new_deaths and new_vaccinations.
explanation: we used pipe function to filter the relevent details that the question deamend and then we used also the top n func, that bring us the top relevent number of data set by defining one of the columns.
high_income = subset.data.frame(covid_19, covid_19$location == 'High income')
high_income %>% top_n(5, new_cases) %>% select(new_cases, date)
## new_cases date
## 1 2589456 2022-01-10
## 2 2737979 2022-01-18
## 3 2830635 2022-01-19
## 4 2559027 2022-01-21
## 5 2532198 2022-01-26
high_income %>% top_n(5, new_deaths) %>% select(new_deaths, date)
## new_deaths date
## 1 9967 2021-01-08
## 2 10170 2021-01-12
## 3 11047 2021-01-20
## 4 10279 2021-01-26
## 5 10079 2021-01-27
high_income %>% top_n(5 ,new_vaccinations) %>% select(new_vaccinations, date)
## new_vaccinations date
## 1 9453347 2021-06-09
## 2 9539660 2021-06-10
## 3 9485687 2021-06-11
## 4 11981277 2021-08-04
## 5 9751526 2021-12-17
High income and Low income countries, shown on the same graph with different colors or symbols. Use meaningful axis and plot labels, and add an informative legend. NA or other no-number values should not be displayed.2a explanation - we vreated function that creating subset by 2 parameters- high or low income, and then making plot that definding bewtween them by the relevent column the user should insert. In the plot we used ggpolt func.
low_vs_high_comp<- function(d_frame, column){
high_income_f = subset.data.frame(d_frame, d_frame$location == 'High income'| d_frame$location == 'Low income')
ggplot(high_income_f, aes(x = date, y= get(column))) +
geom_point(aes( color= location)) +
xlab("date")+
ylab(column)
}
b. Use the function written in (a.) and plot of the number of `new_cases_per_million` vs. date for the high vs. low income countries.
2.b exp- In this task we used the func we were create to make a graghic equation between high and low income location by diffrent column. x- axis show us the date and y axis the number of each relevent column. to calculate the log deatils we created new column to the data set that oparate log function on the relevent column.
low_vs_high_comp(covid_19, "new_cases_per_million" )
covid_19$log_new_cases_smoothed_per_million = log(covid_19$new_cases_smoothed_per_million)
Next, make a similar plot for the log of the smoothed number of new cases per million, new_cases_smoothed_per_million.
low_vs_high_comp(covid_19, "log_new_cases_smoothed_per_million" )
## Warning: Removed 10 rows containing missing values (geom_point).
Which plot is easier to interpret? explain.
Answer- the easiest graph to analize is the log one- beacuse the behavior of the functions are same and its easy to see the diffrent in each date on the graph, They are look the same so we can analyse them same Similarly, make two additional separate plots for the log of the smoothed number of new deaths and new vaccinations per million as a function of date for the high vs. low income countries. Describe the plotted results.
describing the grahs.- as we can see, high income location died more(by log func) and did more vaccinations.
covid_19$log_new_deaths_per_mill = log(covid_19$new_deaths_per_million)
low_vs_high_comp(covid_19, "log_new_deaths_per_mill" )
covid_19$log_vac_per_mill = log(covid_19$new_vaccinations_smoothed_per_million)
low_vs_high_comp(covid_19, "log_vac_per_mill")
## Warning: Removed 680 rows containing missing values (geom_point).
current with one row per country (and other locations), that for each country will store as columns the country name (location) and continent, and also the current values (latest date reported for each value) for: `,total_deaths_per_million,total_vaccinations_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundredandexcess_mortality_cumulative_per_million`.current <- subset(covid_19, select = c(location, date,total_vaccinations_per_hundred, people_fully_vaccinated_per_hundred, total_boosters_per_hundred, excess_mortality_cumulative_per_million, continent, total_deaths_per_million, total_cases_per_million ))
#creating new DS
current <- fill(current,everything())
#fixing the N.A
current <- subset(current, date == max(date), c("location", "continent","total_deaths_per_million", "total_vaccinations_per_hundred", "people_fully_vaccinated_per_hundred", "total_boosters_per_hundred", "excess_mortality_cumulative_per_million", "total_cases_per_million" ))
#comulate the the details to one row for each country.
head(current)
## location continent total_deaths_per_million
## 792 Afghanistan Asia 192.869
## 1595 Africa 184.074
## 2386 Albania Europe 1216.874
## 3177 Algeria Africa 154.091
## 3962 Andorra Europe 1977.920
## 4729 Angola Africa 55.992
## total_vaccinations_per_hundred people_fully_vaccinated_per_hundred
## 792 14.93 11.56
## 1595 34.53 16.16
## 2386 97.51 42.68
## 3177 30.72 13.70
## 3962 196.89 69.04
## 4729 55.28 18.06
## total_boosters_per_hundred excess_mortality_cumulative_per_million
## 792 NA NA
## 1595 1.48 NA
## 2386 9.97 5101.057
## 3177 1.10 1118.164
## 3962 53.06 1158.311
## 4729 0.98 1158.311
## total_cases_per_million
## 792 4487.086
## 1595 8477.007
## 2386 95661.091
## 3177 5956.770
## 3962 530198.826
## 4729 2925.919
for question 3c we added to current the column of total_cases_per_million.
b. Show the values for the current `total_deaths_per_million` in different countries in a histogram with 30 bins. Does this histogram look close to the normal distribution?
Compute the skewness and kurtosis for this distribution, and explain what they mean about the empirical distribution of the data.
skewness(current$total_deaths_per_million, na.rm = T)
## [1] 1.306257
histo_deahth<- hist(current$total_deaths_per_million, breaks=30)
kurtosis(current$total_deaths_per_million, na.rm = T)
## [1] 1.642502
Skewness value are possitive so we have right distortion kurtosis showing us th level of outliers and here is 1.642 .
dosnt look like normal dist.
c. Next, make a scatter plot showing the current `total_deaths_per_million` (y-axis) vs. the current `total_cases_per_million`. Compute a linear regression line of the total number of deaths per million as a function of the total number of cases per million and add the fitted regression line to the plot. What is the slope and what does it represent?
plot(current$total_cases_per_million, current$total_deaths_per_million, xlab = "Total cases per million", ylab = "Total deaths per million")
reg_line = lm(current$total_deaths_per_million ~ current$total_cases_per_million)
abline(reg_line, col = "red")
summary(reg_line)
##
## Call:
## lm(formula = current$total_deaths_per_million ~ current$total_cases_per_million)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2821.6 -585.0 -348.5 519.1 5350.3
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 606.5936124 95.5122157 6.351
## current$total_cases_per_million 0.0039429 0.0004682 8.421
## Pr(>|t|)
## (Intercept) 0.00000000116178243 ***
## current$total_cases_per_million 0.00000000000000433 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1055 on 226 degrees of freedom
## Multiple R-squared: 0.2388, Adjusted R-squared: 0.2355
## F-statistic: 70.92 on 1 and 226 DF, p-value: 0.000000000000004325
‘’As we can see visually, there is a weak positive correlation between the number of cases and number of deaths. We can take the R^2 value we are exposed to in the summary function and by the root of it the get the corr between the two variables.’’
Africa, Asia, Europe, North America, Oceania, South America), make a boxplot of the distribution of the (current) total number of vaccinations per hundred in all the countries in the continent. Show one figure with the six boxplots next to each other. (Use the new current data-frame created in qu. 3). Find two outlier countries (can be of any continent) and write their name and value.# Using box plot func
box_continent = current [, c("continent","total_vaccinations_per_hundred")] %>% filter(continent != "")
a = boxplot(total_vaccinations_per_hundred ~ continent ,
data= box_continent,
main="Disterbution by continent",
xlab="continet",
ylab="total vac per 100",
col="deepskyblue4",
border="brown"
)
#bring 2 outlier counries
current %>% top_n(2, total_vaccinations_per_hundred)
## location continent total_deaths_per_million
## 1 Cuba North America 753.258
## 2 Gibraltar Europe 3027.515
## total_vaccinations_per_hundred people_fully_vaccinated_per_hundred
## 1 315.20 87.84
## 2 355.75 122.94
## total_boosters_per_hundred excess_mortality_cumulative_per_million
## 1 57.66 43.75525
## 2 107.92 581.75774
## total_cases_per_million
## 1 97368.52
## 2 525540.95
b. Define (for the original data-frame from qu. 1) a new column called `booster_ratio`, that lists for each date the fraction of individuals that got a third, booster shot in a country (`total_boosters`), out of all individuals that got two shots (`people_fully_vaccinated`), by dividing the two columns (if either of them is `NA` or if the denominator is zero, set `booster_ratio` to `NA`).
Plot the booster_ratio as a function of time for the six continents (on the same plot, using different colors) and describe the results.
#creating new col and also avoiding dividing in zero
covid_19$booster_ratio = ifelse(covid_19$people_fully_vaccinated > 0, covid_19$total_boosters / covid_19$people_fully_vaccinated, NA)
box_continent_2 = covid_19 [, c("continent","booster_ratio", "date")] %>% filter(continent != "")
#Using smooth ggplot to make the graph more clear to veiw.
ggplot(box_continent_2, aes(x = date, y= booster_ratio)) +
geom_smooth(aes( color= (continent)))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 154690 rows containing non-finite values (stat_smooth).
#As we can see most of the it takes time until the ratio upgrade, and it happend in all the continents, Africa stay behind buy has also some raising.
month_ds = subset(covid_19, date> "2019-12-31" & date < "2022-04-01" )
month_ds = month_ds[,c("location", "date", "new_cases_per_million" )]
month_ds = month_ds %>% group_by(date= format(as.Date(date),"%y-%m")) %>% group_by(date,location) %>% summarise(new_cases_per_million_month = sum(new_cases_per_million))
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
boxplot( new_cases_per_million_month ~ date ,
data= month_ds,
main=" Empirical distribution fo these",
xlab="Month",
ylab="new cases",
col="green",
border="orange"
)
*Guidance:* (i) Beware to not double-count cases/deaths/vaccinations. (ii) Treat each month separately (e.g. March 2020 and March 2022 are different).
b. Repeat (a.), but this time with the total number of `new_deaths` and `new_vaccinations` for each month (two separate plots).
month_ds_death = subset(covid_19, date> "2019-12-31" & date < "2022-04-01" )
month_ds_death = month_ds_death[,c("location", "date", "new_deaths" )]
month_ds_death = month_ds_death %>% group_by(date= format(as.Date(date),"%y-%m")) %>% group_by(date,location) %>% summarise(new_deaths_month = sum(new_deaths))
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
boxplot(new_deaths_month ~ date ,
data= month_ds_death,
main=" new deaths month dist",
xlab="Month",
ylab="new deaths",
col="blue",
border="black")
month_ds_vacc = subset(covid_19, date> "2019-12-31" & date < "2022-04-01" )
month_ds_vacc = month_ds_vacc[,c("location", "date", "new_vaccinations" )]
month_ds_vacc = month_ds_vacc %>% group_by(date= format(as.Date(date),"%y-%m")) %>% group_by(date,location) %>% summarise(new_vacc_month = sum(new_vaccinations))
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
boxplot(new_vacc_month ~ date ,
data= month_ds_vacc,
main=" new vacc per month dist",
xlab="Month",
ylab="new vacc in a month",
col="blue",
border="black")
What can you conclude about the pandemic at different periods from these plots? describe the results for each of the three plots in 2-3 sentences.
WE can see that between MAy to January it was increase in deatts rate in corelation in with vaccination rate.
R_cases, defined for location and date as the number of new_cases_smoothed at this date, divided by the value of the same variable seven days before (if the value seven days before is zero or not defined, define R_cases at the current date to be NA). This column represents roughly the expected number of people that each case infects, and determines the spread of the disease, with values above (below) one indicating that the disease is spreading (declining). r_case_df = covid_19[,c("location", "date", "new_cases_smoothed_per_million")]
r_case_df = r_case_df %>% group_by(location) %>% summarise(R_cases = new_cases_smoothed_per_million / lag(new_cases_smoothed_per_million, n=7), date)
## `summarise()` has grouped output by 'location'. You can override using the
## `.groups` argument.
r_case_df$R_cases[is.infinite(r_case_df$R_cases)] = NA
covid_19$R_cases = r_case_df$R_cases
Plot the `R_cases` value as a function of time for `Israel`, `United Kingdom` and `United States`, and describe the results. <br>
List in a table the number of days at which the disease was spreading (value above 1) in each of the three countries.
country_data = covid_19 [, c("location", "date", "R_cases")] %>% subset( location == "Israel" |location == "United States"| location == "United Kingdom")
ggplot(country_data, aes(x = date, y= R_cases)) +
geom_line(aes( color= (location)))
## Warning: Removed 38 row(s) containing missing values (geom_path).
country_data <- fill(country_data,everything())
country_data_2 = country_data %>% group_by(location) %>% filter(R_cases >=1)
r_isr = length(which(country_data_2$location == "Israel"))
r_US = length(which(country_data_2$location == "United States"))
r_UK = length(which(country_data_2$location == "United Kingdom"))
table_r = c(Israel = r_isr, UK = r_UK, US = r_US)
table_r
## Israel UK US
## 403 446 420
Displaying data on the world map: Use the rworldmap package to display the world map and color each country based on the total_deaths_per_million. Repeat for total_vaccinations_per_hundred, and excess_mortality_cumulative_per_million. Describe the resulting maps in a couple of sentences.
List the top three countries for each of these variables in a table.
Guidance: Use the joinCountryData2Map and mapCountryData commands to make the plots. Keep countries with missing data in white.
world_current = joinCountryData2Map(current, joinCode = "NAME", nameJoinColumn = "location")
## 206 codes from your data successfully matched countries in the map
## 22 codes from your data failed to match with a country code in the map
## 37 codes from the map weren't represented in your data
mapCountryData(world_current, nameColumnToPlot = "total_deaths_per_million")
current %>% top_n(3, total_deaths_per_million) %>% select(location, total_deaths_per_million )
## location total_deaths_per_million
## 1 Bosnia and Herzegovina 4829.845
## 2 Bulgaria 5344.040
## 3 Peru 6377.840
mapCountryData(world_current, nameColumnToPlot = "total_vaccinations_per_hundred")
current %>% top_n(3, total_vaccinations_per_hundred) %>% select(location, total_vaccinations_per_hundred )
## location total_vaccinations_per_hundred
## 1 Chile 271.92
## 2 Cuba 315.20
## 3 Gibraltar 355.75
mapCountryData(world_current, nameColumnToPlot = "excess_mortality_cumulative_per_million")
current %>% top_n(3, total_vaccinations_per_hundred) %>% select(location, excess_mortality_cumulative_per_million )
## location excess_mortality_cumulative_per_million
## 1 Chile 2563.12576
## 2 Cuba 43.75525
## 3 Gibraltar 581.75774
Cross correlations and delay from diagnosis to death: We want to use the data and cross-correlation in order to study the typical time delay between diagnosis of Covid-19 and death from Covid-19 for cases not surviving the disease. For two functions of time \(X(t)\) and \(Y(t)\) (here \(t\) is discrete, representing for example days) we define their cross-correlation at time-delay \(\Delta_t\) as follows: \(cross_{corr}(\Delta_t ; X, Y) = Corr(X(t), Y(t+\Delta_t))\).
That is, the cross-correlation function at the time-delay \(\Delta_t\) for two vectors of length \(n\) is obtained by computing the Pearson correlation coefficient of the vector \(X[1,...,n-\Delta_t]\) with the vector \(Y[\Delta_t+1,...,n]\), for \(\Delta_t>0\). For \(\Delta_t < 0\) we replace the role of \(X\) and \(Y\) in this formula.
cross_corr <- function(corr_df, country, col1, col2){
new_vec <- c() #creating new vector for the cross correlations we will calculate
filter_var <- corr_df%>%filter(location == country)
col1 <- as.numeric(as.list(t(filter_var[,col1])))
col2 <- as.numeric(as.list(t(filter_var[,col2])))
# Defining the columns as Transpose in order to calculate the corr, setting the country for each one.
for (i in c(-60:60)) {
if (i< 0){new_vec <- append(new_vec, cor(col1[(abs(i) + 1):length(col1)], col2[1:(length(col1)-abs(i))], use = "pairwise.complete.obs"))}
else if (i>=0){new_vec <- append(new_vec, cor(col2[(i+1):length(col1)], col1[1:(length(col1)-i)], use = "pairwise.complete.obs"))}
#corr(x[1,...1-delta(t)],y[1+delta(t),..n]), and omitting NA's.
}
return(new_vec)
}
World_cross_corr <- cross_corr(covid_19,"World","new_cases_smoothed","new_deaths_smoothed")
b. Use the function from (a.) to compute the cross correlation between the number of `new_cases_smoothed` and `new_deaths_smoothed` for the entire *World*, and plot it as a function of $\Delta_t$.
At what time delay is the cross correlation maximized? what is your interpretation of this time-delay?
World_cross_corr <- cross_corr(covid_19,"World","new_cases_smoothed","new_deaths_smoothed")
interval <- (c(-60:60))
plot(x=interval,y=World_cross_corr ,xlab = "Delta",ylab = "Cross corr",main = "Cross-Correlation for the world vs.Delta")
death_rate, defined for each location and date as the number of total_deaths divided by the number of total_cases. This column represents the risk of a person diagnosed with covid to die from the disease. Plot for each of the six continents and the entire world the death_rate as a function of time (one plot with separate colors/symbols). Since there is a delay from time of diagnosis to time of death, the initial values of this column are less reliable as a measure of death risk, hence start your plot on January 1st, 2021. Do we see a decrease in the risk over time? can you suggest explanations for the observed trends?covid_19$death_rate = ifelse(covid_19$total_cases >0, covid_19$total_deaths / covid_19$total_cases, NA)
box_continent_2 = covid_19 [, c("continent","death_rate", "date")] %>% filter(continent != "")
box_continent_2 = subset(covid_19, date> "2020-12-31" & date < "2022-04-01" )
ggplot(box_continent_2, aes(x = date, y= death_rate)) +
geom_smooth(aes( color= (continent)))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 11503 rows containing non-finite values (stat_smooth).
There is a huge decrease between those dates. maybe we can asume its a wave of corona that just passed or there is an influence by another parameter like vaccination.
b. Make a similar plot for all continents and the world, but for the `total_vaccinations_per_hundred` variable. Do the plots suggest that the change in risk is correlated to the change in the number of vaccinations?
box_continent_3 = subset(covid_19, date> "2020-12-31" & date < "2022-04-01" )
ggplot(box_continent_3, aes(x = date, y= total_vaccinations_per_hundred)) +
geom_smooth(aes( color= (continent)))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 60832 rows containing non-finite values (stat_smooth).
as we can see there is negative correlation between vaccinations and deaths. as much as more people get the vaccinations, less people dying.
current data-frame to make a scatter plot of the current date total_deaths_per_million vs. the excess_mortality_cumulative_per_million for all countries for which excess mortality data is available. Add the lines \(y=x, y=x+2000\) and \(y=x-2000\) to the plot. Mark on the graph in a different color all the countries for which the difference between the the excess mortality and the covid death rate (per million) is at least 2000 and add their names to the plot (you can use the text function).ggplot(current, aes(x=total_deaths_per_million, y=excess_mortality_cumulative_per_million)) + geom_point(aes(color = (abs(excess_mortality_cumulative_per_million-total_deaths_per_million)>=2000)), show.legend = F) + geom_abline(intercept = 0, slope = 1) + geom_abline(intercept = 2000, slope = 1, col = "gray") + geom_abline(intercept = -2000, slope = 1, col = "brown") + geom_text(aes(label=ifelse(abs(excess_mortality_cumulative_per_million-total_deaths_per_million)>=2000, location, " ")), size = 2.5)
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_text).
b. Pick three countries where the excess mortality was at least 2000 per million *above* the covid death rate, and that have at least $50$ available data points for each for `excess_mortality_cumulative_per_million`.
Use the main covid data-frame to plot as a function of time both the total_deaths_per_million and the excess_mortality_cumulative_per_million for each country (one plot - use different colors/symbols). Identify from the plot for each country the time periods where most deaths not explained by Covid-19 occurred.
ex_mor <- subset(covid_19, location == c("Lithuania","", "Romania", "Bulgaria") , select = c("location", "date", "death_rate", "total_deaths_per_million","excess_mortality_cumulative_per_million"))
## Warning in location == c("Lithuania", "", "Romania", "Bulgaria"): longer object
## length is not a multiple of shorter object length
ex_mor_plot <- ggplot(data = ex_mor) + geom_smooth(aes(x = date, y = excess_mortality_cumulative_per_million, colour = location)) + geom_smooth(aes(x = date, y = total_deaths_per_million, colour = location))
ex_mor_plot
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 508 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 14 rows containing non-finite values (stat_smooth).